Level 1

Peter Prevos

01 March 2022

Data Science for Water Professionals

Level 1

  • Introduction to data science
  • Principles of the R language
  • Analyse water quality data

Level 2

  • Collect and clean and analyse customer data
  • Cluster analysis
  • Linear regression

Level 3

  • Analysing time series
  • Writing functions
  • Principles of machine learning

Learning Objectives Level 1

  • Apply the principles of strategic data science to solve water problems
  • Write R code to load, analyse, and visualise data
  • Diagnose water quality data with descriptive statistics
  • Develop presentations, reports, and applications to share results

Course Book

lucidmanager.org/r4h2o

Program Level 1

  1. Principles of Data Science
  2. Basics of the R Language
  3. Case Study 0: Open channel flows
  4. Case Study 1: Water quality data
    • Reading CSV and Spreadsheet data
    • Descriptive Statistics
  5. Visualise data with ggplot2
  6. Data science workflow
  7. Creating data products

Introductions

Principles of Data Science

What is Data Science?

Conway Venn Diagram (drewconway.com).

Don’t Believe the Hype

Source: Harvard Business Review.

What is Good Data Science?

Vitrivian triangle (venustas, firmitas, utilitas)

What is Useful Data Science?

What is Sound Data Science?

Spreadsheets are Chaos

Code is Poetry

library(readxl)
reservoirs <- read_excel("reservoirs.xlsx", sheet = 1)

reservoirs %>%
    select(Date, River_Flow, Natural_Flow) %>%
    muate(ERV = ifelse(Natural_Flow <= 8, 4, Natural_Flow / 2),
          Date = as.Date(Date, format = "%d/%m/%Y")) %>%
    pivot_longer(-Date,
                 values_to = "Flow",
                 names_to = "Source")  %>%
    ggplot(aes(Date, Flow, col = Source)) +
    geom_line()

Basics of the R Language

Desktop version

  1. Download and install R and RStudio
  2. Download course materials
  3. Unzip folder
  4. In RStudio: File > Open Project
  5. Open r4h2o.Rproj

Cloud Version

  1. Register at cloud.rstudio.com
  2. New Project > New Project From Github Repo
    • https://github.com/pprevos/r4h2o/
  3. RStudio will download the project files

Principles of R

  • Console (REPL: Read-Eval-Print Loop)
  • Mathematical operators (+ - * ^ %% %/%)
  • Variable assignment
  • Assignment operator (<- versus =)
  • Functions: sum(), prod(), abs(), log(a, base = b)
    • Parameters
    • Use = sign
    • Auto-completion

BODMAS

3 - 3 * 6 + 2
## [1] -13

Arithmetic

diameter <- 150
pipe_area <- (pi / 4) * (diameter / 1000)^2
pipe_area
## [1] 0.01767146
sqrt(pipe_area / (pi / 4)) * 1000
## [1] 150

Vectors

complaints <- c(12, 13, 23, 45, 22, 99, 31)

sum(complaints)
prod(complaints)
abs(complaints)
exp(complaints)
factorial(complaints)
log(complaints, base = 3)
log10(complaints)

Basic Plotting 1

diameter <- 50:350
pipe_area <- (pi / 4) * (diameter / 1000)^2
plot(diameter, pipe_area, type = "l", col = "blue", main = "Pipe Section Area")
abline(v = 150, col = "grey", lty = 2)
abline(h = (pi / 4) * (150 / 1000)^2, col = "grey", lty = 2)
points(150, (pi / 4) * (150 / 1000)^2, col = "red")

Coding Principles

First computer bug (1947)

Using the Editor

Natural Language

  • Nouns (variables - meaningful
    • _ or . or camelCase
  • Verbs (functions)
  • Phrases (code)
  • Context

Bugs

  • Typos (lintr)
  • Error messes (copy and paste into search engine)
  • Help files

Scripts and Projects

  • Working Directory
  • Open Script
  • Commenting
  • Indentation

Case Study 0: Channel Flow

Discharge formula

\[Q = \frac{2}{3} C_d \sqrt{2g} \; lh^\frac{3}{2}\]

  • \(Q\): Discharge in m3/s
  • \(C_d\): Discharge coefficient
  • \(g\): Acceleration of gravity (9.81 m/s2)
  • \(l\): Crest length [m]
  • h: Head above crest level [m]

Coding Practice

Create an R script and answer:

  1. What is the flow in the channel in m3/s when the height \(h = 100\) mm?
  2. What is the average flow for these three heights: 150mm, 136mm, 75mm, in litres per second?
  3. Visualise the flow in m3/s for all heights between 50mm and 500mm

\(Q = \frac{2}{3} C_d \sqrt{2g} \; lh^\frac{3}{2}\)

  • \(Q\): Discharge in m3/s
  • \(C_d\): Discharge coefficient (\(C_d=0.62\))
  • \(g\): Acceleration of gravity (9.81 m/s2)
  • \(l\): Crest width (\(l=0.5\) m)
  • \(h\): Head above crest level [m]

Exploring Data with Tidyverse

Base R functions

  • Basic programming functions
  • Arithmetic
  • Statistics
  • Plotting
  • Extendable with functions

R Packages

  • Packages to extend base functionality
    • Distributed mainly through CRAN
    • Comprehensive R Archive Network
  • Call library to access functions: library(dplyr)
  • Or, add library name and two colons: dplyr::filter()

Tidyverse

  • Collection of R packages
  • Easy data manipulation
  • ‘syntactic sugar’

Case Study 1: Water Quality

Obtaining Data

  • Database (SQL, Oracle)
  • Desktop files (spreadsheets, CSV)
  • Web (HTML, XML, JSON)
    • API
    • Scraping

CSV files

  • Use only the top row as a header
  • Don’t use colours to indicate values
  • Prevent using spaces in column names
  • Don’t add any spreadsheet calculations
  • Every cell should be a data point or remain empty

Reading CSV files

  • readr package for CSV files (part of Tidyverse)
    • read_csv() faster alternative for read.csv()
    • Look at help text
  • File names in R
    • Unix-based (slash instead of backslash)
    • Results in data frame / tibble
library(tidyverse)
gormsey <- read_csv("casestudy1/gormsey.csv")

names(gormsey)
dim(gormsey)
ncol(gormsey)
nrow(gormsey)

head(gormsey)
str(gormsey)
glimpse(gormsey)

Variable classes

  • Integer (\({\ldots -3, -2, -1, 0, 1, 2, 3, \ldots}\))
  • Numeric (Real numbers) (\(\mathbb{R}\))
  • Complex numbers (\(a + bi\))
  • Text ("abcd")
  • Dates ("2022-03-01")
  • Factors (classifications): ("Male", "Female", "Other")
  • Boolean (TRUE, FALSE)

Variable Types

Scalar, vector and data frame / tibble (matrix)

Exploring data frames

  • df[rows, columns]
  • df$column
  • glimpse(df)
gormsey[12:13, ]
gormsey[, 4:5]
gormsey[1:2, c(2, 4:5, 6)]
gormsey[1:2, c(-1, -3, -7)]
gormsey$Date[1:6]

What is the sample number of the last sample in the Gormsey data?

Hint, use the nrow() function.

Conditionals

Compare variables

a <- 1:2
a == 1
a != 1
a >= 2

a <- c(TRUE, FALSE)
a * 2

"small" > "large" 

gormsey$Measure == "Turbidity" & gormsey$Result > 5

Filtering

# Three methods

gormsey[gormsey$Measure == "Turbidity", ]

subset(gormsey, Measure == "Turbidity")

library(dplyr)
filter(gormsey, Measure == "Turbidity")

turb5 <- filter(gormsey, Measure == "Turbidity" & Result > 5)

Counting Results

length(gormsey$Measure)
## [1] 2422
unique(gormsey$Measure)
## [1] "Chlorine Total" "E. coli"        "Turbidity"      "THM"
table(gormsey$Measure)
## 
## Chlorine Total        E. coli            THM      Turbidity 
##            760            760            168            734

dplyr counting

count(gormsey, Measure)
## # A tibble: 4 × 2
##   Measure            n
##   <chr>          <int>
## 1 Chlorine Total   760
## 2 E. coli          760
## 3 THM              168
## 4 Turbidity        734

Coding Practice

Write a script to answer these questions:

  1. How many turbidity results in all Towns, except Strathmore, are lower than 0.1 NTU?
  2. Which sample points had a non-zero E. coli count?
  3. Create a table that counts the number of THM samples per town.

Descriptive Statistics

Central Tendency

turbidity <- filter(gormsey, Measure == "Turbidity")

mean(turbidity$Result)
sum(turbidity$Result) / length(turbidity$Result)

median(turbidity$Result)

Calculating the mode of continuous data is complex and requires a specialised package.

Dispersion

min(turbidity$Result)
max(turbidity$Result)
range(turbidity$Result)
diff(range(turbidity$Result))

var(turbidity$Result)
sd(turbidity$Result)

sqrt(sum((turbidity$Result - mean(turbidity$Result))^2) / 
       (length(turbidity$Result) - 1))

Quantiles

  1. Place the observations in ascending order: \(y_1, y_2, \ldots y_n\).
  2. Calculate the rank (\(r\)) of the required percentile
  3. Interpolate between adjacent numbers: \[y_r = y_{\lfloor r \rfloor}+ (y_{r_{\lceil r \rceil}} - y_{\lfloor r \rfloor})(r - \lfloor r \rfloor)\]
    • \(y\): Observation
    • \(r\): Ranking number
    • \(\lfloor r \rfloor\): Floor or \(r\)
    • \(\lceil r \rceil\): Ceiling of \(r\)

IQR(turbidity$Result)
summary(turbidity$Result)
quantile(turbidity$Result, type = 6, probs = 0.95)

Quantiles (continued)

Grouped Analysis

Grouped Analysis

measures <- group_by(gormsey, Measure)
summarise(measures, max = max(Result))

measures_town <- group_by(gormsey, Measure, Town)
summarise(measures,
          Samples = n(),
          Mean = mean(Result),
          Median = median(Result),
          p95 = quantile(Result, type = 6, probs = 0.95))

Coding Practice

  1. What is the average number of samples taken at the sample points in Gormesey?
  2. Which sample point has breached the maximum value of 0.25 mg/l for THM most often?
  3. What is the highest 95th percentile of the turbidity for each of the towns in Gormsey, using the Weibull method?

Coding Excercise

  • Go to the Bureau of Meteorology website
  • Download rainfall data and for your nearest station
  • Determine the top five years with the highest total rainfall
  • Tips:
    • Variable names with spaces need to be between back ticks: `variable name`
    • The data has missing values. Use the na.rm = TRUE option in the sum() function
    • Use the top_n() function to list the top five years

Solution

library(tidyverse)
bom <- read_csv("casestudy1/IDCJAC0009_088110_1800_Data.csv")
bom_year <- group_by(filter(bom, Year != 2022), Year)
annual_rain <- summarise(bom_year, 
                         TotalRain = sum(`Rainfall amount (millimetres)`, 
                                         na.rm = TRUE))
top_n(annual_rain, 5)
## # A tibble: 5 × 2
##    Year TotalRain
##   <dbl>     <dbl>
## 1  1973      970 
## 2  1974      796 
## 3  1993      802.
## 4  2010      962.
## 5  2011      830.

Visualising Data

Jackson Pollock

Jackson Pollock (1952) Blue Poles number 11. Drip Painting in enamel and aluminium paint with glass on canvas (National Gallery, Canberra).

Piet Mondrian

Piet Mondrian (1928) Composition with red, yellow and blue. Oil on canvas (Municipal Museum, the Hague).

Data-Pixel Ratio

Data Story-Telling

Grammar of Graphics

  1. Data: exists at the lowest level, without which there is nothing to visualise.
  2. Aesthetics: defines which graph variables are visualised and how they look (colour, line shapes and sizes).
  3. Geometries: shapes that represent the data, such as bars, pies or lines.
  4. Facets: used to divide a visualisation into subplots.
  5. Statistics: relate to any specific transformations to summarise the data, such as trend lines.
  6. Coordinates: define how data is represented on the canvas. Mostly used in mapping.
  7. Themes: defines the non-data pixels (font sizes, backgrounds and so on)

Base Plots

  • Best for exploratory visualisations
  • Plotting functions
    • plot()
    • barplot()
    • histogram()
    • boxplot()
  • Annotation functions (add elements to the plot)

Base plotting parameters

  • pch: the plotting symbol (default is open circle)
  • lty: the line type (default is solid line), can be dashed, dotted, etc.
  • lwd: the line width, specified as an integer multiple
  • col: the plotting color, specified as a number, string, or hex code
  • xlab: character string for the x-axis label
  • ylab: character string for the y-axis label

Examples

library(tidyverse)
gormsey <- read_csv("casestudy1/gormsey.csv")
turbidity <- filter(gormsey, Measure == "Turbidity")

plot(turbidity$Date, turbidity$Result, 
     type = "l",
     xlab = "Date",
     ylab = "Result",
     main = "Turbidity measurements")
abline(h = 5, col = "red")

boxplot(log10(Result) ~ Town, data = turbidity,
        pch = 19, las = 3, col = "brown",
        main = "Turbidity measurements")
abline(h = log10(5), col = "red")

p95 <- summarise(group_by(turbidity, Town), 
                 p95 = quantile(Result, 0.95))

barplot(p95$p95, names.arg = p95$Town)
abline(h = 5, col = "red")

ggplot2 package

Examples 1

ggplot(gormsey, aes(Town))

ggplot(gormsey, aes(Measure)) + 
    geom_bar()

turbidity <- filter(gormsey, Measure == "Turbidity")

ggplot(turbidity, aes(Date, Result, col = Town)) + 
    geom_line()

Color Palettes

ColorBrewer

ggplot(gormsey, aes(Town, fill = Measure)) +
    geom_bar() +
    scale_fill_brewer(type = "qual",
                      palette = "Dark2")

library(RColorBrewer)
display.brewer.all() 

ggplot(gormsey, aes(Town, fill = Measure)) +
    geom_bar() +
    scale_fill_manual(values = c("cornflowerblue",
                                 "darkseagreen",
                                 "#ee6611",
                                 "#ccaa44"))

Facets

ggplot(turbidity, aes(Date, Result, col = Town)) + 
    geom_line() + 
    facet_wrap(~Town)

Statistics

thm <- filter(gormsey, Measure == "THM")
thm_grouped <- group_by(thm, Date)
thm_max <- summarise(thm_grouped, thm_max = max(Result))

ggplot(thm_max, aes(Date, thm_max)) + 
    geom_smooth(method = "lm") + 
    geom_line() + 
    geom_hline(yintercept = 0.25, col = "red")

Practice Task: Convert this visualisation to a faceted graph to show the trend per system.

Coordinates

ggplot(turbidity, aes(Town, Result)) + 
    geom_boxplot() + 
    scale_y_log10(name = "Samples (log)",
                  n.breaks = 10) +
    coord_flip()
  • scale_x_log10(): Logarithmic scale.
  • scale_x_discrete(): Discrete variables (names).
  • scale_x_continuous(): Continuous variables, such as measurements.
  • scale_x_date(): For displaying dates and times.

Themes

ggplot(turbidity, aes(Date, Result)) + 
    geom_area(col = "dodgerblue", fill = "dodgerblue") + 
    facet_wrap(~Town, ncol = 1) + 
    theme_void(base_size = 24)

ggplot(gormsey, aes(Measure)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90))

ggsave("casestudy1/measures.png", width = 15, height = 10, dpi = 300, units = "cm")

Combining data sets

chlorine <- filter(gormsey, Measure == "Chlorine Total")

chlorine_gr <- group_by(chlorine, Town)
  
chlorine_avg <- summarise(chlorine_gr, avg = mean(Result))

ggplot(chlorine, aes(Date, Result)) + 
  geom_line() + 
  facet_wrap(~Town) + 
  geom_hline(data = chlorine_avg, aes(yintercept = avg), col = "blue", lty = 2) + 
  theme_minimal() + 
  labs(title = "Average total chlorine leves in Gormsey",
       x = NULL, y = "Total Chlorine")

ggplo2 Cheat Sheet

https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-visualization-2.1.pdf

Coding Excercise

  • Go to the Bureau of Meteorology website: http://www.bom.gov.au/climate/data/stations/
  • Download rainfall data and for your nearest station
  • Determine the top five years with the highest total rainfall
  • Visualise the results with ggplot2
  • Tips:
    • Variable names with spaces need to be between back-ticks: ` variable name `
    • The data has missing values. Use the na.rm = TRUE option in the sum() function
    • Use the top_n() function to list the top ten years

Solution

library(tidyverse)
bom <- read_csv("casestudy1/IDCJAC0009_088110_1800_Data.csv")
bom_year <- group_by(filter(bom, Year != 2022), Year)
annual_rain <- summarise(bom_year, 
                         TotalRain = sum(`Rainfall amount (millimetres)`, 
                                         na.rm = TRUE))
top10 <- top_n(annual_rain, 5)
ggplot(annual_rain, aes(Year, TotalRain)) + 
  geom_col()

Data Science Workflow

Define

The regulator for water quality has released a new guideline that lowers the maximum value for trihalomethanes (THMs) at the customer tap to 0.20 mg/l. This report assesses the historical performance of the Gormsey water system to evaluate the risk of non-compliance, assuming no operational changes are implemented.

Load and Tidy

library(readr)
library(dplyr)
gormsey <- read_csv("casestudy1/gormsey.csv")
thm <- filter(gormsey, Measure == "THM")
glimpse(thm)
## Rows: 168
## Columns: 7
## $ Sample_No    <dbl> 608188, 618273, 620904, 629974, 623529, 638727, 659889, 6…
## $ Date         <date> 2069-01-11, 2069-01-16, 2069-01-16, 2069-01-23, 2069-01-…
## $ Sample_Point <chr> "BL_15694", "SO_12411", "SW_17608", "TA_16763", "ME_15385…
## $ Town         <chr> "Blancathey", "Southwold", "Swadlincote", "Tarnstead", "M…
## $ Measure      <chr> "THM", "THM", "THM", "THM", "THM", "THM", "THM", "THM", "…
## $ Result       <dbl> 0.00300, 0.00300, 0.02100, 0.00300, 0.02600, 0.00300, 0.0…
## $ Units        <chr> "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "mg/L", "…

Explore

Explore THM

ggplot(thm, aes(Town, Result)) + 
  geom_boxplot() + 
  geom_hline(yintercept = .2, col = "red", linetype = "longdash") + 
  scale_y_log10() + 
  coord_flip() + 
  theme_minimal() + 
  labs(title = "THM Results",
       subtitle = "Gormsey") 

Model

“All models of reality are wrong, but some are useful.”

filter(thm, Result > 0.2)
## # A tibble: 5 × 7
##   Sample_No Date       Sample_Point Town      Measure Result Units
##       <dbl> <date>     <chr>        <chr>     <chr>    <dbl> <chr>
## 1    646594 2069-02-06 ME_19428     Merton    THM      0.763 mg/L 
## 2    697871 2070-03-26 SO_12771     Southwold THM      0.475 mg/L 
## 3    626268 2070-06-12 ME_16234     Merton    THM      0.355 mg/L 
## 4    674225 2070-08-06 ME_19428     Merton    THM      0.268 mg/L 
## 5    672190 2070-12-10 ME_16234     Merton    THM      0.714 mg/L

Reflect

Further investigation required to determine the cause

Communicate (RMarkdown)

Tables

library(knitr)
thm_fail <- filter(gormsey, Measure == "THM" & Result > .25)
kable(select(thm_fail, Date, Town, Result),
      caption = "Example outut of the `kable()` function.", 
      digits = 2)
Example outut of the kable() function.
Date Town Result
2069-02-06 Merton 0.76
2070-03-26 Southwold 0.48
2070-06-12 Merton 0.36
2070-08-06 Merton 0.27
2070-12-10 Merton 0.71

Rounding Numbers

a <- sqrt(777)
round(a)
## [1] 28
round(a, 2)
## [1] 27.87
round(a, -1)
## [1] 30
floor(a)
## [1] 27
ceiling(a)
## [1] 28
signif(a, 5)
## [1] 27.875